Classifying Very-High-Dimensional Data with Random Forests of Oblique Decision Trees

نویسندگان

  • Thanh-Nghi Do
  • Philippe Lenca
  • Stéphane Lallich
  • Nguyen-Khang Pham
چکیده

The random forests method is one of the most successful ensemble methods. However, random forests do not have high performance when dealing with very-high-dimensional data in presence of dependencies. In this case one can expect that there exist many combinations between the variables and unfortunately the usual random forests method does not effectively exploit this situation. We here investigate a new approach for supervised classification with a huge number of numerical attributes. We propose a random oblique decision trees method. It consists of randomly choosing a subset of predictive attributes and it uses SVM as a split function of these attributes. We compare, on 25 datasets, the effectiveness with classical measures (e.g. precision, recall, F1-measure and accuracy) of random forests of random oblique decision trees with SVMs and random forests of C4.5. Our proposal has significant better performance on very-high-dimensional datasets with slightly better results on lower dimensional datasets. Thanh-Nghi Do Institut Telecom; Telecom Bretagne UMR CNRS 3192 Lab-STICC Université européenne de Bretagne, France Can Tho University, Vietnam e-mail: [email protected] Philippe Lenca Institut Telecom; Telecom Bretagne UMR CNRS 3192 Lab-STICC Université européenne de Bretagne, France e-mail: [email protected] Stéphane Lallich Université de Lyon, Laboratoire ERIC, Lyon 2, France e-mail: [email protected] Nguyen-Khang Pham IRISA, Rennes, France Can Tho University, Vietnam e-mail: [email protected]

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Hybrid weighted random forests for classifying very high-dimensional data

Random forests are a popular classification method based on an ensemble of a single type of decision trees from subspaces of data. In the literature, there are many different types of decision tree algorithms, including C4.5, CART, and CHAID. Each type of decision tree algorithm may capture different information and structure. This paper proposes a hybrid weighted random forest algorithm, simul...

متن کامل

On Oblique Random Forests

Abstract. In his original paper on random forests, Breiman proposed two different decision tree ensembles: one generated from “orthogonal” trees with thresholds on individual features in every split, and one from “oblique” trees separating the feature space by randomly oriented hyperplanes. In spite of a rising interest in the random forest framework, however, ensembles built from orthogonal tr...

متن کامل

Stratified sampling for feature subspace selection in random forests for high dimensional data

For high dimensional data a large portion of features are often not informative of the class of the objects. Random forest algorithms tend to use a simple random sampling of features in building their decision trees and consequently select many subspaces that contain few, if any, informative features. In this paper we propose a stratified sampling method to select the feature subspaces for rand...

متن کامل

Data Mining of High Accuracy for the Efficiency in the Task of Massive Printing

Random forests are known to be robust for missing and erroneous data as well as irrelevant features. Moreover, even though the forests have many trees, they can utilize the fast building property of decision trees, so they do not require much computing time. In this paper an efficient procedure that utilizes random forests to predict the cylinder bands in rotogravure printing is shown. Even tho...

متن کامل

REGRESSION LEAF FOREST: A FAST AND ACCURATE LEARNING METHOD FOR LARGE & HIGH DIMENSIONAL DATA SETS by SIVANESAN GANESAN

There are a number of learning methods that provide solutions to classification and regression problems, including Linear Regression, Decision Trees, KNN, and SVMs. These methods work well in many applications, but they are challenged for real world problems that are noisy, nonlinear or high dimensional. Furthermore, missing data (e.g., missing historical features of companies in stock data), i...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009